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New Tools for Phylogenetic reconstruction 
using character state trees 


A. TIEFENBRUNNER, M. TIEFENBRUNNER & W. TIEFENBRUNNER 


Abstract: From the very beginning of algorithmic supported phylogenetic 
reconstruction, character state trees were seen as an important tool. Later on this 
changed due to interest shift to molecular biological data and — we believe — because no 
simple representation of character state trees within the taxon/character — matrices that 
are fundamental for any algorithmic phylogenetic reconstruction, were developed. 


Here we present a new algorithm for Camin — Sokal parsimony and some new tools that 
ease computing and simplify the representation of states within the taxon/character — 
matrix even for very complex character state trees. This method can not only be used 
for morphological data, but is an aid for the combination of cladograms, developed 
using e.g. molecular biological data, too. A software that uses these tools is available as 
freeware from the correspondence author. 


Key words: Phylogenetic reconstruction, character state trees, Camin — Sokal 
parsimony. 


Introduction 


Only shared derived character states should be used to proof close relationship of taxons 
(HENNIG 1966). This is the most fundamental principle of "post darwinian", phyloge- 
netic, systematics. It is well known that not to distinguish between primitive (plesio- 
morh) and derived (apomorph) states, not taking into account the stepwise evolution of 
them, not to deal with transition series or — if the evolution of states occurred with rami- 
fications — with character state trees (CSTs), will lead to "pre darwinian" systematics that 
rests upon general similarity and not on phylogenetics. This of course remains true, 
whether or not the reconstruction is done by computer algorithms. 


Taking this into account, it is surprising that some of the most popular methods of "phy- 
logenetic" reconstruction, e. g. "maximum likelihood", "Wagner parsimony" or bayesian 
approaches, are not able to use existing information concerning the historical order of 
character states and thus create similarity trees instead of phylogenetic ones. The usage 
of such methods was criticised repeatedly. Exemplary we cite BERGSTROM & 
XIANGUANG 1998: "Characters used without an understanding of their historical order 
will most probably distort any cladogram. This is why computer programs for clado- 
grams are very dangerous tools for those who think they have found a shortcut to map 


phylogeny". 
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The algorithm supported reconstruction of phylogeny started in accordance with the 
fundamental principle of phylogenetic systematics, even before Hennig formulated it. 
The first algorithm for the reconstruction of phylogenetic relationship (a clustering pro- 
cedure later on named tree popping) was developed by Konrad Lorenz in 1941. He ap- 
plied it to a species/characters — matrix with 48 characters (mainly behavioural ones) and 
20 anatid species. Of course it was not an algorithm for a computer program (which may 
be the reason why his contribution fell into oblivion) because fast, generally available 
computers did not exist at this time. Lorenz instead used a wire model to test his algo- 
rithm. Most characters were binary, with known historical sequence, one state being 
primitive relative to the other. This is the simplest possible CST. A state of a character 
can be primitive relative to another one and at the same time derived to a third. This 
leads to CSTs that may be very complex, as shown in Fig.1. 
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Fig. 1: Evolution of the ovipositor and the ootheca of Dictyoptera (Insecta) as an example of a 
relatively complex character state tree (based on data of GRIMALDI & ENGEL 2005 and EHRMANN 
2002). Each node of the tree corresponds to an observed character state. 

When CAMIN & SOKAL (1965) started a remarkable experiment to recognize how 
talented taxonomists successfully reconstruct phylogenetic relationships, they not only 
invented the "maximum parsimony" method of phylogenetic reconstruction but were also 
aware of the significance of CSTs. They practically used a simplified version. By the 
usage of a method developed by SOKAL & SNEATH 1963 for dividing complex CSTs into 
several binary factors — simple "characters" with only two states where one is more 
derived than the other – KLUGE & FARRIS 1969 were able to further develop the idea of 
Camin and Sokal. Nevertheless the coding of the states remained uninformative 
concerning the question of the relative position of the states in the original CST. No 
algorithm for the automatic division of a CST into binary factors was presented, so that 
the creation of complex CSTs was not supported. 


About this time (1968) a coding system that gives information about the relative position 
of the states in a CST was published by Estabrook but was "only" used for progress in 
theory, not for practical approaches (e.g. usage in a species/characters — matrix). 


Later on the Sankoff algorithm was developed (SANKOFF & CEDERGREN 1983) that 
allowed the usage of CSTs too. Unfortunately using this algorithm, the computational 
effort increases with the square of the number of states. Therefore only simple CSTs can 
be utilized. 
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As a consequence, although phylogentic reconstruction software exists that deals with 
linear transition series, to our knowledge currently computer programs that utilize CSTs 
of any complexity are not available. In this article we present a new algorithm that allows 
the usage of CSTs of any complexity for the creation of cladograms without necessity to 
divide them into binary factors and with a computational effort that increases only linear 
with the number of states. A gratis software that uses this algorithm is available. 


Camin — Sokal parsimony and Character State Trees 


CAMIN & SOKAL 1965 used a simple notation to symbolize the structure of a CST and to 
compute the evolutionary distances between states. Unfortunately it has the disadvantage 
that only a single bifurcation at the root is possible. The root is labeled with 0, to the left 
hand the nodes get stepwise decremented by one, to the right hand they get incremented 
by one, for instance -2, -/, 0, 1 (fig. 2a). This system makes the computation of evolu- 
tionary distances between two states very easy. If more than one bifurcation of a CST is 
desired, this advantage gets lost and additional symbols are necessary, e.g. 1' and 1". 
KLUGE & FARRIS 1969 solved this problem by splitting a character into the necessary 
amount of binary "subcharacters", called "factors" (fig. 2c). Of course, the necessity to 
split does not enhance the creation of complex CSTs. 
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Fig. 2а-с: The same character state tree with Camin – Sokal (a) and Matrioshka (b) coding, and (с) 
as a combination of three subcharacters (factors), each one with two states. 

Another way to deal with CSTs of any complexity is to use the Sankoff algorithm 
(SANKOFF & CEDERGREN 1983) for non-uniformly weighted states (FELSENSTEIN 2004). 
The cost for a substitution from a primitive to a more derived state depends on the dis- 
tance between the two states on the CST, the cost for a substitution in the opposite direc- 
tion is infinite. For the example used in fig. 2, using the notation of 2a, the cost matrix 
would be: 

Tab. 1: Cost matrix for Camin-Sokal parsimony using the Sankoff algorithm and the CST of fig. 2a. 


to descendant 
1 0 -1 -2 














from 1 0 со о | © 

ancestor 0 1 0 1 2 
-1 со со 0 1 
-2 со оо o0 0 




















Using the Sankoff algorithm the computational effort increases with the square of the 
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number of character states of a CST (WILLIAMS & FITCH 1990). Once again this does not 
inspire the creation of complex CSTs. What we need is a simple algorithm with the fea- 
tures that the computational effort increases arithmetically with the number of character 
states. To reach this aim we further require 


- abasic notation to describe а CST, and 
- elementary tools for calculation. 


e  Matrioshka — operator 


To symbolize the structure of a CST, we use an operator that connects two states. It is a 
pointer that points from one state at its immediate ancestral neighbour. We call it 
"matrioshka — operator". 


For example, in fig. 2b C points at 8: C[B]. The immediate primitive state is written in 
brackets, which of course is an arbitrary convention. The name of the operator refers to 
the famous Russian doll that contains a doll that contains a doll ... and so on. The reason 
becomes obvious if we present C in this way: C[B[A[]]]. This is called the path from C 
to the root. To get knowledge of the whole structure of the tree for each state we only 
need to know its immediate more primitive neighbour. Thus this operator is appropriate 
to be used within the taxon / character — matrix (Tab.2). 


Tab. 2: Example of a species / characters — matrix with matrioshka — coding. In character 4 there 
are more states in the character state tree than species in the matrix. Because the description of the 
character state tree within the matrix must be complete, here it is necessary to connect more than 
two states with the Matrioshka operator. Redundancy within the matrix is allowed (in character 3 
the connection "state3[state2]" appears two times, although the second time "state3" would give 
enough information), but not necessary (characters 2 and 4). The root of the CST points at no- 
where, the brackets remain empty. 


Char 1 Char 2 Char 3 Char 4 
Species 1 A[] о State 1[] D[B[AT[II] 
Species 2 B[A] 0 State2[Statel] E[B] 
Species 3 C[B] 0 State3[State2] F[C[A]] 
Species 4 D[A] 10] | State3[State2] GIC] 


For the creation of a CST there are some rules that must be valid for Matrioshka — states: 


1. Uniqueness of the root. There is exactly one state that points at nowhere, the 
root of the CST. If the most primitive state is for instance A, we write A[]. 
Prohibition of self — reference. No state points at itself: A[A] is forbidden. 

3. Prohibition of cyclic reference. A[A] is a special case of a loop connection. 
Cyclic references are generally forbidden, e.g. B[A] and A[B]. 

4. Uniqueness of the reference. A state cannot point at more than one other state: 
СТА, B] is forbidden. 


5. Contrary to rule 4 it is of course allowed that any number of states point at one 
state, as long as rules 1 to 4 аге not violated. Within a CST from one node any 
number of branches can sprout (see also ESTABROOK 1968). 
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e Matrioshka - set 


Our primary tool for calculations is the matrioshka — set. The elements it includes are all 
the states we visit, 1f we, starting from a state, go to the root (all states that belong to the 
path from a state to the root). As an example let us take C from fig. 2b. The matrioshka — 
set MC associated to C is: MC = (A, B, С}, for D, MD = (A, D}. For A, MA = {A}. 
There is no empty matrioshka — set. 

The intersection of two Matrioshka — sets has all characteristics of a matrioshka — set, 
too. This means that there exists a state that is an element of this set which has all the 
other elements of the set as ancestral states. Furthermore all its ancestral states — without 
any exception — are elements of the set. The state that has this characteristics is the ele- 
ment with the highest matrioshka — value. 


e  Matrioshka - value 


The matrioshka — value of a state is the quantity of elements of its matrioshka — set. The 
matrioshka — value of C, | C | =з; of D, | fp | 22; of A, | mA |=1 (fig. 2b). 


Evolutionary distance of two character states 


As an example we want to compute the distance of the character states D and G of the 
CST of fig. 3 (which has the same structure as the one of fig. 1). To do this we need the 
path from D to the root A (fig 3a) and the path from G to the root (fig. 3b). The distance 
is the number of states that belong either to the path shown in fig. 3a or to the one shown 
in 3 b, but not to both. The states that fulfil this criterion can be seen in fig. 3c as dark 
circles. 





a b с 


Fig. 3a-c: A CST with the same structure as the one of fig. 1. The paths to the root of the states D 
(За) and С (3b) are accentuated, as well as the elements of the set Spg (fig. Зс, see eq. 1). 

Because all the states that belong to the path from a state to the root are elements of the 
Matrioshka set of this state, too, to compute the evolutionary distance dag of any two 
character states A and B we may define а set $ дв: 

Eq. 1) Sag = (A ТІВ) – (ПА ~ mp) 

Then ддв is | Sa | ‚the quantity of elements of Sap. 

In our example, fig. 3: MD = (А, B, C, Р}; MG = (A, B, С, Е, С) and Sy; = {DFG}. 
Therefore dpg=3. 

If we follow the above instructions, we count how many branches at the CST are 
separating two states. This is our definition of "evolutionary distance". 
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Cladogram (Taxon tree) 


A cladogram is a reconstruction of the evolution from the last common hypothetical 
ancestor to the taxons, of which we want to know the relationship. So each cladogram is 
a hypothesis about the relationship of any taxons (e. g. species). To simplify the follow- 
ing explanations, here we discuss the topology of cladograms. Its fundamental structure 
is that of a binary dendrogram. It consists of: 


1. nodes (п tips, which represent existing taxons and n-1 inner nodes that sym- 
bolize hypothetical taxons) 


2. 2(п-1) branches that symbolize the path of evolution from an existing or 
hypothetical taxon and — in the direction of the root — a second one that repre- 
sents its ancestor (see also 3). The length of a branch is the evolutionary dis- 
tance of the nodes that are connected by the branch. 


3. а bifurcation consists of three nodes and two branches. Two of the nodes lie in 
the direction of the tips (they can be tips). The converging branches connect 
them with the third one, the origin, that lies in the direction of the root. The ori- 
gin represents the immediate, common ancestor. 


4. The root is the hypothetical common ancestor of all nodes of ће cladogram. 


Orientation: to distinguish cladograms and CSTs easily, we draw cladograms from left 
(root) to right (tips) and CSTs from bottom (root) to top. 


Quality of the reconstruction of a cladogram 


It is necessary to estimate the quality of the reconstruction of the evolutionary course the 
cladogram represents, so that we can choose a better in favour of a worse one. Our 
criterion of quality is the global quantity of necessary transitions from one character state 
to another. The less "evolutionary steps" in the whole cladogram are necessary, the better, 
which means that we are searching for the most parsimonious hypothesis (principle of 
economy, CAMIN & SOKAL 1965). 
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Fig. 4a-b: (а) Cladogram for five species (the tips ТІ to Т5) and the character which is coded in 
fig. 2b as CST. The inner nodes of the cladogram are hypothetical species. Their character state is 
reconstructed by the algorithm described in the text. Prominent branches denote an evolutionary 
step (a change of state). (b) Bifurcation of a cladogram with the character coded in fig. 3. 
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Fig. 4a gives an example of a cladogram for five species. The character states of the interior 
nodes are already reconstructed. It is easy to see that for each bifurcation those states are 
accentuated in the CST of the bifurcation origin (dark circles) that belong to both accentuated 
paths of the CSTs of the two nodes which lie in the direction of the tips. Thus, if O ist the 
state of the origin, and A and B are the states of the tipwards nodes, it follows: 


Eq. 2) "О =("А n "B) 


This has the consequence that the reconstructed state of the origin is not necessarily identical 
to the state of one of the tipwards nodes. Take as an example the CST of fig. 3 and the 
bifurcation of a cladogram (fig. 4b). If the tipwards nodes of any bifurcation of a cladogram 
have the states D and G, respectively, following eq. 2 we would assign the state C to the 
origin. 


Using matrioshka — sets, the algorithm that allows us to count the number of necessary 
transitions for any evolutionary course and thus for any cladogram, is very simple to 
calculate. The procedure needs two steps: 


For each character: starting with those bifurcations where the both right nodes are tips, we 
firstly create the matrioshka — sets of the states that are related to them. Next we calculate the 
intersection of these matrioshka - sets. The result is related to the basal node, the origin. Next 
we take the bifurcations where the Matrioshka sets of both right nodes are already known — 
they are either the calculation result of a right bifurcation or a single tip. In doing so we move 
leftwards until we reach the bifurcation that has the root as origin. 


Using these sets we determine the length of each branch. As already mentioned the length is 
the evolutionary distance of the nodes that are connected by the branch. We can compute it 
with the aid of equation 1 using the Matrioshka — sets that were related to the nodes in step 
one. 


We have to do this for all branches of the cladogram and then have to sum up the results. 
Furthermore we must sum up the results for each character to reach our final aim, the quality 
criterion of the cladogram. 


The computational effort of this algorithm increases linearly with the number of character 
states. 


Reconstruction of the course of evolution 


During step 1 we associated a matrioshka set to each node of the cladogram. The element of 
this set, which has the highest matrioshka — value is the state that must be associated to the 
node. 


Zusammenfassung 


Bereits die allerersten Versuche, durch die Verwendung von Algorithmen zu phylogenetischen 
Rekonstruktionen zu kommen, benützten Merkmalsbáume als wichtiges Hilfsmittel. Mit der Inter- 
essenverlagerung hin zu molekularbiologischen Daten veränderte sich das, weil bei diesen die 
Unterscheidung zwischen "primitiveren" und "abgeleiteteren" Merkmalen meist nicht getroffen 
werden kann. Außerdem existierte zu diesem Zeitpunkt keine einfache Möglichkeit Merkmals- 
baume in Taxon/Merkmal - Matrizen zu reprásentieren. 
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Hier beschreiben wir einen neuen Algorithmus fiir "Camin-Sokal-Parsimony" (Prinzip der Spar- 
samkeit) und einige neue Verfahren, die einerseits der Berechnungsvereinfachung dienen, anderer- 
seits die Darstellung von Merkmalsausprägungen innerhalb einer Taxon/Merkmal - Matrix auch für 
beliebig komplexe Merkmalsbáume ermóglichen. Gegenwártig ist diese Methode natürlich vor 
allem für morphologische Merkmale interessant, kónnte sich aber auch für Cladogramme als nütz- 
lich erweisen, die kombinierte Datensets, z. B. auch molekularbiologische Daten, verwenden. Eine 
Software, die die neuen Verfahren benützt, ist als Freeware beim Korrespondenzautor erhältlich. 
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Appendix I 


As a comemorative publication to the 70° birthday of the famous ornithologist Oskar 
Heinroth, in 1941 Konrad Lorenz, one of the founders of ethology and nobel prize win- 
ner of 1973, published a voluminous article to show that behavioural patterns can be 
genetically fixed and hence can be as well used in systematics, as morphological charac- 
ters. This proof was immediately recognized as an important one. In the same publica- 
tion, more or less by the way, Lorenz invented a method how to create a phylogenetic 
tree by the usage of a species/characters-matrix (48 mainly behavioural characters from 
twenty Anatid species, fig. 1). This method is now known as tree popping and was rein- 
vented (in a very derived version) 40 years later by Meacham (MEACHAM 1981). 
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Fig. 1 (from LORENZ 1941): Resulting tree out of а species/characters-matrix of 48 mainly behav- 
ioural characters from twenty Anatid species (in fact, this graphic is matrix and tree in one). 
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In the publication of Lorenz 1941, most of the characters were binary. His algorithm 
started with a sorting process. Those characters, where most of the species showed the 
derived state, were used first. The species that showed the derived state were bundled 
together. Lorenz of course did not use a computer in 1941, instead he used for each spe- 
cies a vertically oriented, thick wire. Those species that shared a derived state were bun- 
dled with a horizontally oriented, thin wire that represents a character. Character by 
character or state by state, respectively, the reconstruction, fig. 1, occurred. 


The lines with letters define the characters (for character explanation see LORENZ 
(1941)), the numbers define the species. Horizontal lines: common, derived character 
states. As can be seen, only derived states are used for clustering (with two exceptions). 
Vertical lines are leading to species. Crosses: missing characters; circles: special differ- 
entiation of the characters, question mark: lack of knowledge. The species names are 
from LORENZ (1941) as follows: 


1: Cairina moschata, 2: Lampronessa sponsa, 3: Aix galericulata, 4: Mareca sibilatrix, 
5: Mareca penelope, 6: Chaulelasmus strepera, 7: Nettion crecca, 8: Nettion flavirostre, 
9: Virago castanea, 10: Anas spp., 11: Dafila spinicauda, 12: Dafila acuta, 13: Poecilon- 
etta bahamensis, 14: Poecilonetta erythrorhyncha, 15: Querquedula querquedula, 16: 
Spatula clypeata, 17: Tadorna tadorna, 18: Casarca ferruginea, 19: Anser spp., 20: 
Branta spp. 


The names of 1, 3, 10, 17, 19 and 20 are valid. The valid names of the others are: 2: Aix 
sponsa, 4: Anas sibilatrix, 5: Anas penelope, 6: Anas strepera, 7: Anas crecca, 8: Anas 
flavirostris, 9: Anas castanea, 11: Anas georgica spinicauda, 12: Anas acuta, 13. Anas 
bahamensis, Anas erythrorhyncha, 15: Anas querquedula, 16: Anas clypeata, 18: 
Tadorna ferruginea. 


Fig. 1 is a tree and a species/characters — matrix in one. Thus we can transform fig. 1 into 
a species/characters — matrix with matrioshka — coding (tab. 1) and use it to create a 
Camin — Sokal — parsimony cladogram (fig.2) with the aid of our software (PYRE Clas- 
sic for PhYlogenetic REconstruction using character state trees). 


Because Lorenz gave in his article a very accurate description of the characters, it is 
possible to connect the simple characters to more complex character state trees. Instead 
of 48 characters, we get only 18 (tab. 2a and 2b). We can use tab. 2b for Camin — Sokal 
reconstruction too (fig. 3). Because less characters have to be used, the software arrives 
at a result earlier than with the data of tab. 1. The result is of course virtually the same (if 
the data transformation would be perfect, it would be the same. In some cases the text of 
the article and fig. 1 are contradictory). Furthermore fig. 2 leads more or less to the same 
result than fig. 1. 
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Fig. 2: Result of a Camin-Sokal parsimony phylogenetic reconstruction using the data from tab. 1. 
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Fig. 3: Result of a Camin-Sokal parsimony phylogenetic reconstruction using the data from tab. 2a. 
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Tab. 1: Species/characters — matrix out of fig. 1., continuation. 
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Poecilonetta bahamensis (13 


Cairina moschata (1 
Mareca sibilatrix (4 
Nettion flavirostre (8 
Spatula clypeata (16 
Virago castanea (9 


Anser spp. 
Branta spp. 
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Tab. 2b: Description of the new character state trees. 
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1-3, 11-18 
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2 oz 


1,19,20 
14, 17, 18 
2-10 

11, 12 
13 

15, 16 


mgomreo 


П 0: 1-3, 17-20 
A: 4-16 

IV 0: 1-8, 11-20 
A: 9, 10 


V: A: 1-3 
B: 4-16 
C: 17-20 


A: 1,12, 14, 17-20 
B: 4, 5, 7-11, 13 
C:6 

D:2,3 

E: 15, 16 


Characters 


2ST 
Mkst 
Kh 
TrKh 


PiH 
Skh 
Hv 
Rr 


DC 


Ns 


HE 
PE 


Antr 
Sp 
Bfk 
Spf 


Tab. 2b: Continuation 
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EIS 
Gg 


Ges 


KnTr 
Epf 


Kd 
Gg 
OP 


Hkz 
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Tab. 2b: Continuation 
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Characters 


Ar 
Pn 


GISp 


Ss 
Spi 


Fz 


LS 


Sz 


Ssn 


